Advances in imaging equipment and automation have led to an overabundance of data on the functioning of the brain. Technologies today can sample brain activity from a large number of neurons in a large region while organisms are actively behaving. For example, by simultaneously recording the electrical activity of every neuron of the mouse brain over an extended period of time, the amount of data generated will create completely new paradigms for biology, that will require the development of tools to extract value from such unprecedented amount of information.

In this Notebook, we use PySpark and the Thunder project, which is developed on top of PySpark, for processing large amounts of time series data in general, and neuroimaging data in particular. We will use these tools for the task of understanding some of the structure of Zebrafish brains, which is a typical (and simple) example used in Neuroimaging. Using Thunder, we will cluster different regions of the brain (representing groups of neurons) to discover patterns of activity as the zebrafish behaves over time.

Note: Please, use the documentation for the Thunder API to learn the details of function calls!

Goals

The main goals of this notebook are:

  1. Learn about Thunder and how to use it
  2. Revisit the K-Means algorithm and the method for choosing K
  3. Learn alternative approaches to improve the results

Steps

  1. In section 1, we go though some background concepts that are used in this notebook.
  2. Next, in section 2, we will get familiar with Thunder, its methods and its data types, by working on some simple tasks.
  3. Finally, in section 3, we will build a model to cluster the neurons of a zebrafish based on their behavior. In this step, we will learn about how to use K-Means when the value of K is unknown. Finally, some tricks to improve the results are introduced.

1. Background concepts

In this section, we cover the terminology and the concepts that constitute the domain knowledge for this notebook.

As it should be well-known, a pixel is a combination of "picture element": digital images can be modeled as simple 2-dimensional (2D) matrices of intensity values, and each element in the matrix is a pixel. In color images, a pixel contains values of red, green, and blue channels. In a grayscale image, the three channels have the same value, such that each pixel is reduced to be a single value.

A single 2D image is not nearly enough to express 3D objects, which use a voxel, representing a value of the 3D image on a regular grid in a three-dimensional space. A possible technique to work on 3D images is to acquire multiple 2D images of different slices (or planes, or layers) of a 3D object, and stack them one on top of each other (a z-stack). This ultimately produces a 3D matrix of intensity values, where each value represents a volume element or voxel.

This z-stack image has 4 layers. A point is a voxel. It can be determined by the layer's index and the position in that layer.

In the context of the Thunder package, we use term image for 3D-image or stack image. Thunder uses Image type to represent 3D-image. Each Image object is a collection of either 2D images or 3D volumes. In practice, it wraps an n-dimensional array, and supports either distributed operations via Spark or local operations via numpy , with an identical API.

Stack-images can represent 3D objects, but it can be difficult to take the temporal relationship of the images into account. To do that, we need another data structure that shows the changes of voxels over time. In the Thunder package, the internal Series type can be used exactly for this purpose. Each Series is a 1D array such that each element is a value of the voxel at a timestamp.

The most common series data is time series data, in which case the index is time and each record is a different signal, like a channel or pixel.

We now have sufficient material to start playing with Thunder !!!

2. Let's play

Well, wait a second before we play... Remember, we're going to use Spark to perform some of the computations related to this Notebook. Now, when you spin a Zoe Notebook application (this comment is valid for students at Eurecom), you'll gain access to an individual, small Spark cluster that is dedicated to your Notebook. This cluster has two worker machines, each with 6 cores. As such, a good idea to obtain smooth performance and a balanced load on the workers, is to repartition your data (i.e., the RDDs you use to represent images or time series).

In this Notebook we expect students to take care of repartitioning, and such care will be compensated by bonus points.

2.1. Play with Image objects

a. Loading image data

Both images and series can be loaded from a variety of data types and locations. You need to specify whether data should be loaded in 'local' mode, which is backed by a numpy array, or in 'spark' mode, which is backed by an RDD by using the optional argument engine. The argument engine can be either None for local use or a SparkContext for` distributed use with Spark.

import thunder as td

# load data from tif images
data = td.images.fromtif('/path/to/tifs')

# load data from numpy-arrays
data = td.series.fromarray(somearray)
data_distributed = ts.series.fromarray(somearray, engine=sc)

We can load some example image data by:

In [1]:
import thunder as td
import numpy as np

# load some example image data
image_data = td.images.fromexample('fish', engine=sc)

# print the number of images
print(image_data.count())
20

b. Inspecting image data

In [2]:
%matplotlib inline
import matplotlib.pyplot as plt

# import two function to draw images easier
from showit import image as draw_image
from showit import tile as draw_tile

print("Shape of the data:", image_data.shape)

first_image = image_data.first() # get the values of Image object
# or first_image = image_data[0] # get the Image object

print("Shape of the data of the first image:", first_image.shape)

print("Data of the first image:", first_image)


# draw the first layer of the first image
draw_image(first_image[0])

# draw all layers of the first image
draw_tile(first_image)

# we can use index slices to take images
samples = image_data[0:6]
Shape of the data: (20, 2, 76, 87)
Shape of the data of the first image: (2, 76, 87)
Data of the first image: [[[26 26 26 ..., 26 26 26]
  [26 26 26 ..., 26 26 26]
  [26 26 26 ..., 27 27 26]
  ..., 
  [26 26 26 ..., 27 27 26]
  [26 26 26 ..., 27 26 26]
  [25 25 25 ..., 26 26 26]]

 [[25 25 25 ..., 26 26 26]
  [25 25 25 ..., 26 26 26]
  [26 26 26 ..., 26 26 26]
  ..., 
  [26 26 26 ..., 26 26 26]
  [26 26 26 ..., 26 26 26]
  [25 25 25 ..., 26 26 26]]]

One of the advantages of working in Python is that we can easily visualize our data stored into Spark RDDs using the Matplotlib library. Function draw_image and draw_tile that take advantages of Matplotlib are examples.

From the result above, the shape of the loaded data is (20, 2, 76, 87). It means we have total 20 3D images objects. Each image has 2 layers, each layer has size 76x87.

Note that, although data is not itself an array (it can be a kind of RDD), we can index into it using bracket notation, and pass it as input to plotting methods that expect arrays. In these cases, the data will be automatically converted.

Question 1

a) Use the function `imgshow` from matplotlib to plot each layer of the first image in `image_data`. b) Discuss the choice of parameters you use for the method `imgshow`
In [3]:
img = image_data.first() 
# or:
# img = image_data[1]
plt.figure(figsize=(15,6))
# show the first layer
plt.subplot(1, 2, 1)
plt.axis('off')
plt.imshow(img[0], interpolation='nearest', cmap='gray')
plt.title("Layer 0 of the first image")

# show the second layer
plt.subplot(1, 2, 2)
plt.axis('off')
plt.imshow(img[1], interpolation='nearest', cmap='gray')
plt.title("Layer 1 of the first image")
plt.show()
In [4]:
#COLOR MAP PARAMETER
plt.figure(figsize=(15,16))
# show the first layer+
plt.subplot(1, 4, 1)
plt.axis('off')
plt.imshow(img[1], cmap='seismic')
plt.title("Seismic cmap")
# show the second layer in colours
plt.subplot(1, 4, 2)
plt.axis('off')
plt.imshow(img[1], cmap = 'hot')
plt.title("Hot cmap")
# show the second layer in colours
plt.subplot(1, 4, 3)
plt.axis('off')
plt.imshow(img[1], cmap = 'Blues')
plt.title("Blues cmap")
# show the second layer in colours
plt.subplot(1, 4, 4)
plt.axis('off')
plt.imshow(img[1], cmap= 'spectral')
plt.title("Spectral cmap")
plt.show()
We first used the "classical" grayscale view to compare the two first images. Then we tested several parameters. For the colormap, it seems that 'spectral' cmap gives a more interesting image where we can see the active parts of the brain more precisely than for instance with a hot cmap.
In [5]:
#INTERPOLATION PARAMETER
plt.figure(figsize=(15,6))
# show the first layer+
plt.subplot(1, 4, 1)
plt.axis('off')
plt.imshow(img[1], interpolation='nearest', cmap= 'spectral')
plt.title("Nearest interpolation")

# show the second layer in gray 
plt.subplot(1, 4, 2)
plt.axis('off')
plt.imshow(img[1], interpolation='gaussian', cmap= 'spectral')
plt.title("Gaussian interpolation")

# show the second layer in colours using gaussian interpolation
plt.subplot(1, 4, 3)
plt.axis('off')
plt.imshow(img[1], interpolation='lanczos', cmap= 'spectral')
plt.title("Lanczos interpolation")


# show the second layer in colours
plt.subplot(1, 4, 4)
plt.axis('off')
plt.imshow(img[1], interpolation='sinc', cmap= 'spectral')
plt.title("Sinc interpolation")
plt.show()
Playing with the interpolation parameter directly influences the way our image is smoothed. As we can see pixels are still visible with the 'nearest' interpolation. If we are looking for good contrast and minimal blur, 'sinc' interpolation seems to be a good choice. This is suitable, of course, only for visualization: for analysis, the 'nearest' interpolation is better because it will not create new values for the voxels (to smooth the image) and the colormap is not important.

Then, we can perform operations that aggregate information across images.

Question 2

Calculate the standard deviation across all images you have in `image_data` (that is, our dataset). To clarify, let's focus on an individual layer (say the first layer of each image). For every `voxel`, compute the standard deviation of its values across different images for the same layer. Visualize the standard deviation you obtain, for example concerning a single layer (as before, say the first layer).

HINT 1
to avoid wasting time and energy, make sure you lookup for methods that could help answer the question from the Thunder documentation.

HINT 2
We can also use function draw_image(<data>) to plot an image in a simple way instead of using many statements with matplotlib as before.

NOTE
Comment the image you obtain. What does it mean to display the standard deviation across all images in a single layer?

In [6]:
# calculate standard deviation across images
std_imgs = image_data.std()
print("We went from " + str(image_data.shape) + " data to " + str(std_imgs.shape) + " data with this operation")
We went from (20, 2, 76, 87) data to (1, 2, 76, 87) data with this operation
As we can see, from twenty 3D images we calculated one 3D image which represents the variation of the values of our voxels across the 20 initial images: each neuron has 20 states, available in our dataset.
In [7]:
# show the first layer standard deviation 
plt.figure(figsize=(15,6))
plt.subplot(1, 2, 1)
plt.axis('off')
plt.imshow(img[0], interpolation='sinc', cmap= 'spectral')
plt.title("Layer 0 of the first image")
plt.subplot(1, 2, 2)
plt.axis('off')
plt.imshow(img[1], interpolation='sinc', cmap= 'spectral')
plt.title("Layer 1 of the first image")
plt.show()

plt.figure(figsize=(15,6))
plt.subplot(1, 2, 1)
plt.axis('off')
plt.imshow(std_imgs.first()[0], interpolation='sinc', cmap= 'spectral')
plt.title("Layer 0s standard deviation")
plt.subplot(1, 2, 2)
plt.axis('off')
plt.imshow(std_imgs.first()[1], interpolation='sinc', cmap= 'spectral')
plt.title("Layer 1s standard deviation")
plt.show()
- For the regular images : the red areas are where the brain has a high activity (high value for the given voxel)
- For the standard deviation : the red areas are where the brain activity changes most. As we can see, many voxels have a standard deviation of 0: those are the voxels that fall outside of the brain in our images and they should be taken care of in our analysis.

c. Selecting samples of image data

The Images API offers useful methods for working with large image data. For example, in some cases it is necessary to subsample each image, to make sure we can analyze it efficiently.

Question 3

The source code below subsamples image data with different ratios on different dimensions. a) Complete the source code to plot the first layer of the first image. b) What is the shape of `image_data` before and after subsampling?
In [8]:
subsampled = image_data.subsample((1, 5, 5))
# Stride to use in subsampling. If a single int is passed, each dimension of the image
# will be downsampled by this same factor. If a tuple is passed, it must have the same
# dimensionality of the image. The strides given in a passed tuple will be applied to
# each image dimension
plt.figure(figsize=(15,6))
plt.subplot(1,2,1)
plt.imshow(subsampled.first()[0], interpolation='nearest', cmap= 'spectral') 
plt.title("Subsampled image")
plt.axis('off')
plt.subplot(1,2,2)
plt.imshow(subsampled.first()[0], interpolation='sinc', cmap= 'spectral') 
plt.title("Subsampled image interpolated with 'sinc'")
plt.axis('off')
plt.show()
print("Before subsampling:", image_data.shape)
print("After subsampling:", subsampled.shape)
Before subsampling: (20, 2, 76, 87)
After subsampling: (20, 2, 16, 18)
It means we have total 20 3D images objects. Each image has 2 layers, each layer has size 16x18 instead of 76x87, which means we compressed the information.

Note that subsample is an RDD operation, so it returns immediately. Indeed, we know that in Spark you must apply a RDD action to trigger the actual computation.

d. Converting image data

We can also convert an RDD of images to a RDD of series by:

In [9]:
series_data = image_data.toseries()
series_data.cache()
Out[9]:
Series
mode: spark
dtype: uint8
shape: (2, 76, 87, 20)

Question 4

According to your understanding about `Series` objects which was introduced in section 1, what is the shape of `seriesRDD` and its elments ? Comment your results, don't just display numbers.
The shape of the RDD series_data is (2, 76, 87, 20) This means that each series has 2 layers of size 76x87 evolving during a certain timestamp. It also means that we have 20 samples of this evolution over this timestamp.

For a large data set that will be analyzed repeatedly as a Series, it will ultimately be faster and more convienient to save Images data to a collection of flat binary files on a distributed file system, which can in turn be read back in directly as a Series, rather than repeatedly converting the images to a Series object. This can be performed either through a ThunderContext method, convertImagesToSeries, or directly on an Images object, as done below:

series_data.tobinary('directory/data', overwrite=True) #tolocal
ts = td.series.frombinary('directory/data', engine=sc)

We will study about Series object in the next section.

2.2. Play with Serises objects

a. Loading Series data

In this section, we use a sample data to explore Series objects.

# series_data = td.series.fromexample('fish', engine=sc)
series_data = td.series.frombinary(path='s3n://thunder-sample-data/series/fish', engine=sc)

b. Inspecting Series data

Series_data is a distributed collection of key-value records, each containing a coordinate identifier and the time series of a single voxel. We can look at the first record by using first(). It’s a key-value pair, where the key is a tuple of int (representing a spatial coordinate within the imaging volume) and the value is an one-dimensional array.

In [10]:
first_series = series_data.first() # get the values of Series object
#first_series = series_data[0] # get a Series object

print("Shape of series:", series_data.shape)
print("The first series:", first_series)
print("Each element in series has", len(first_series), "values")
# print the 10th value of voxel (0,0,0)
# layer = 0
# coordinator = (0,0) in that layer
print("Value 10th of voxel (0,0,0):", np.array(series_data[0,0,0,10]))
Shape of series: (2, 76, 87, 20)
The first series: [26 26 26 26 26 26 26 25 26 25 25 25 26 26 26 26 26 26 26 26]
Each element in series has 20 values
Value 10th of voxel (0,0,0): 25

The loaded series data is a multi-dimensional array. We can access the values of a voxel in time series by using a tuple as above. In our data, each voxel has 20 values corresponding to 20 states at 20 different times.

c. Selecting Series data

Series objects have a 1D index, which can be used to subselect values.

In [11]:
print("shape of index:", series_data.index.shape)
print("the first element of a subset", series_data.between(0,8).first())
shape of index: (20,)
the first element of a subset [26 26 26 26 26 26 26 25]

Values can be selected based on their index:

In [12]:
print(series_data.select(lambda x: x > 3 and x < 8).index)
print(series_data.select(lambda x: x > 3 and x < 8).first())
[4, 5, 6, 7]
[26 26 26 25]

Question 5

Plot the first 20 values of **all** series objects (that is the values of a voxel) in the series data. This means, on the same plot, you should visualize the values each voxel takes in the first 20 time intervals.
In [13]:
# only select the first 20 states of each object
samples = series_data.between(0,20).tordd().values().collect()

plt.figure(figsize=(15,6))
plt.plot(np.array(samples).T)
plt.show()

Now, another objective we can have is to select specific series objects within the same series data. For example, we can select objects randomly by using function sample.

Question 6

Let's plot a random subset of the data using the method `sample`. Complete the source code below to plot the first 20 values of 30 objects that are selected randomly among those that pass the condition on the standard deviation, using function `sample`.
In [14]:
# select 30 objects randomly which have standard deviation > threshold
# Extract random subset of records, filtering on a summary statistic.
examples = series_data.filter(lambda x: x.std() > 10.0).sample(30)
# only plot first 20 states of each object
plt.figure(figsize=(15,6))
plt.plot(np.array(examples).T)
plt.show()

d. Preprocessing Series data

A Series objects has some methods which can be useful in an eventual preprocessing phase.

For example,center subtracts the mean, normalize subtracts and divides by a baseline (either the mean, or a percentile).

In [15]:
examples = series_data.center().filter(lambda x: x.std() >= 10).sample(50)
plt.figure(figsize=(15,6))
plt.plot(np.array(examples).T)
plt.show()
In [16]:
normalizedRDD = series_data.normalize(method='mean').filter(lambda x: x.std() >= 0.1).sample(50)
plt.figure(figsize=(15,6))
plt.plot(np.array(normalizedRDD).T)
plt.show()

e. Computing statistics about Series data

A Series can be summarized with statistics both within and across images. To summarize across records (the statistic of all voxels at each timestamp), we can do the following:

In [17]:
plt.figure(figsize=(15,6))
plt.plot(series_data.normalize().max()); #max in blue
plt.plot(series_data.normalize().mean()); #mean in green
plt.plot(series_data.normalize().min()); # min in red

To summarize within records, we can use the map method:

In [18]:
#Generate summary rdds
means = series_data.map(lambda x: x.mean())
stdevs = series_data.map(lambda x: x.std())

#Flatten them to arrays
flat_means = means.flatten().toarray()
flat_stdevs = stdevs.flatten().toarray()

print("Means:", flat_means)
print("Length of means (corresponds to 2*76*87):", len(flat_means))
print("Mean of the first series:", flat_means[0])
print("Standard deviation of the first series:", flat_stdevs[0])
Means: [ 25.8   25.85  25.7  ...,  26.    26.    26.  ]
Length of means (corresponds to 2*76*87): 13224
Mean of the first series: 25.8
Standard deviation of the first series: 0.4

means is now a Series object, where the value of each record is the mean across the time series for that voxel.

Note that in the source code above, we use function toarray to return all records to the driver as a numpy array.

For this Series, since the keys correspond to spatial coordinates, we can pack the results back into a local array in driver node.

To look at this array as an image, we can use function draw_image as before.

In [19]:
# we should recover the shape of means before plotting
# draw the stdandard deviations of series that belong to the first layer
draw_image(flat_means.reshape((2, 76, 87)) [0,:,:])
draw_image(flat_stdevs.reshape((2, 76, 87)) [0,:,:])
plt.show()

Note that toarray is an example of a local operation, meaning that all the data involved will be sent to the Spark driver node. In this case, packing the mean is no problem because its size is quite small. But for larger data sets, this can be very problematic. So, it's a good idea to downsample, subselect, or otherwise reduce the size of your data before attempting to pack large image data sets!

f. Identifying correlations

In several problem domains, it may also be beneficial to assess the similarity between a designated signal (time series) and another signal of interest by measuring their correlation. For example, say we have two time series corresponding to the consumption of Coca Cola and Pepsi, it would perhaps be interesting to verify whether behavioural patterns are similar for both brands over time.

Simply as a proof of concept, we shall compare our data to a random signal and we expect that, for a random signal, the correlation should be low. The signal can be stored as a numpy array or a MAT file containing the signal as a variable. Note that the size of the signal must be equal to the size of each Series element.

In [20]:
from numpy import random
signal = random.randn(len(first_series))
print("The correlation of the first element with random signal:" , series_data.correlate(signal).first())

first_element = series_data.first()
corr = series_data.correlate(np.array(first_element)).first()
print("The correlation of the first element with itselft:", corr)
The correlation of the first element with random signal: [ 0.12030014]
The correlation of the first element with itselft: [ 1.]

3. Usecase

3.1. Context

Neurons have a variety of biochemical and anatomical properties. Classification methods are thus needed for clarification of the functions of neural circuits as well as for regenerative medicine. In this usecase, we want to categorize the neurons in a fish brain, based on their behavior. The behavior of a neuron can be expressed by the change of its states. The activities of the brains are captured over time into images.

Neurons have a variety of biochemical and anatomical properties. Classification methods are thus needed for clarification of the functions of neural circuits as well as for regenerative medicine. In this usecase, we want to categorize the neurons in a fish brain, based on their behavior. The behavior of a neuron can be expressed by the change of its states. The activies of the brains are captured over time into images.

In this notebook, we use K-Means, a well known clustering algorithm which is also familiar to you, as it was introduced during the last lecture on anomaly detection.

3.2 Data

The dataset we will use is the time series data which we played with in the previous section. Refer to section 2 if you want to duplicate the code to load such data.

3.3. Building model

a. Importing required modules

In [21]:
%matplotlib inline
import matplotlib.pyplot as plt
from pyspark.mllib.clustering import KMeans, KMeansModel
from matplotlib.colors import ListedColormap

b. Loading & inspecting the data

Question 7

Load example series data from `fish`, then normalize and cache it to speed up repeated queries. Print the dimensional information of the loaded data.
In [22]:
# we must normalize it to get best clustering
data = series_data.normalize(method='mean')

#Since the cluster we are working on has 12 cores available
data.repartition(12)
# cache it to speed up related queries
data.cache()
# check the dimensions of data
print (data.shape)

n_data = data.count()
print (n_data)
(2, 76, 87, 20)
13224

Question 8

When studying the properties of large data set, we often take a small fraction of it. We have many strategies to select this subset, such as selecting randomly, selecting elements that has the standard deviation bigger than a threshold, or mixing the conditions. In this notebook, we will use the second method as a small demonstration.
    In order to choose a good value for the threshold of standard deviation, we should compute the stddev of each series and plot a histogram of a 10% sample of the values.
      Complete the source code below to compute the standard deviation of series in data. Plot the histogram of it and discuss it in details. In your opinion, what should be the best value for the threshold ?
      In [23]:
      # calculate the standard deviation of each series
      # then select randomly 10% of value to plot the histogram
      sub = int(n_data*0.1)
      stddevs = data.map(lambda x : x.std()).sample(sub).flatten().toarray()
      
      # plot the histogram of 20 bins
      plt.figure(figsize=(15,6))
      plt.xlabel('Standard deviation')
      plt.hist(stddevs, 20)
      plt.show()
      
      The following histogram is representative of the global behaviour of our dataset. The first bar stands out, implying that we have a very large amount of voxels (almost 50%) that have a standard deviation of 0: in fact we already identified those voxels as the ones falling outside the brain in our image data. We are interested in keeping data concerning the brain only, and preferably representative data. Hence we will ignore also the data having a very weak variation, probably due to the fact that the voxels represent the skull or very inactive brain zones. Hence we can choose a **threshold of 0.02**: the data with an std above that value seems to correspond to what we are looking for.
      In [24]:
      #Calculating the threshold
      threshold = 0.02
      print("%.2f %% of the initial data selected" %(data.filter(lambda x : x.std() > threshold).count()/n_data *100))
      
      16.84 % of the initial data selected
      

      Question 9

      Extract some samples just to look at the typical structure of the time series data. The objects are selected randomly, and has the standard deviation bigger than the threshold which you picked in question 8. Plot the samples and discuss your obtained figure.
      In [25]:
      # sample 50 objects of the data randomly base on the standard deviation
      examples = data.filter(lambda x: x.std() > threshold).sample(50)
      examples_not = data.filter(lambda x: x.std() <= threshold).sample(50)
      # plot the sample data
      plt.figure(figsize=(15,6))
      plt.subplot(1,2,1)
      plt.plot(np.array(examples).T)
      plt.title("Kept data")
      plt.subplot(1,2,2)
      plt.plot(np.array(examples_not).T)
      plt.title("Data not considered")
      plt.show()
      
      We can clearly see the difference between the data we choose to keep for our analysis and the data that we neglect. The kept data is representative of a certain behaviour with a variation of 0.2 for the extrema, whilst the other one is more chaotic around 0 (variation of 0.1).

      c. Clustering series

      In this section, we will use K-means to cluster the series. In other words, we cluster the voxels based on the their behavior. Currently, we have no clue about how many groups K of neural behavior. To this end, instead of choosing a single value K, we use multiple values, build model with each K and compare the resulting error values. After that, we can choose the best value of K.

      Question 10

      Complete the source below to build multiple models coresponding to multiple values of `K` using algorithm KMeans of Thunder. a) Comment the structure of the code. Precisely, focus on the `for` loop, and state what is parallel and what is not. b) Can you modify the structure of the code such that you use the most of the parallelization capabilities of Spark?
      In [26]:
      from time import time
      
      In [27]:
      # declare the possible values of K
      ks = [5, 10, 15, 20, 30, 50, 100, 200]
      
      # convert series data to rdd of values
      training_data = data.tordd().map(lambda x: np.array(x[1]) ).cache() #is normalized
      
      def buildModels(tdata):
          # declare the collection of models
          models = [] 
      
          # build model for each K and append to models
          for k in ks:
              # initializationMode='k-means||' is the parallelized version of k-means++
              # initializationSteps=2 
              models.append(KMeans.train(tdata, k, maxIterations=100)) # 100 bonne valeur ?
          return models
      t0 = time()
      models = buildModels(training_data)
      print("Finished in %.2f s" %(time()-t0))
      
      Finished in 34.58 s
      
      In [28]:
      from sklearn.cluster import KMeans as KMeans_sk
      ks_p = sc.parallelize([5, 10, 15, 20, 30, 50, 100, 200])
      
      #broadcast the data used in each parallel execution
      bData = sc.broadcast(training_data.collect())
      
      def buildModels0(data):
          # init='k-means++' for non parallel execution of Kmeans
          # n_init corresponds to initializationSteps in the cell above
          models = ks_p.map(lambda x : (KMeans_sk(x, max_iter=100).fit(data.value)))
          return models.collect()
      
      t0 = time()
      models_sk = buildModels0(bData)
      print("Finished in %.2f s" %(time()-t0))
      
      /opt/conda/lib/python3.5/site-packages/sklearn/utils/fixes.py:64: DeprecationWarning: inspect.getargspec() is deprecated, use inspect.signature() instead
        if 'order' in inspect.getargspec(np.copy)[0]:
      
      Finished in 14.14 s
      
      - In the first case, the MLLIB Kmeans works on an RDD in a parallelized way. However each iteration of the for loop is performed sequentially. - In the second case, we parallelize the execution of the for loop. We take another Kmeans implementation which is not designed for RDDs, so not parallel. The result in the second case is signicantly better.

      d. Testing models & choosing the best one

      Next, we evaluate the quality of each model. We use two different error metrics on each of the clusterings.

      • The first is the sum across all time series of the Euclidean distance from the time series to their cluster centroids.

      • The second is a built-in metric of the KMeansModel object.

      Question 11

      a) Write function `model_error_1` to calculate the sum of Squared Euclidean Distance from the Series objects to their clusters centroids. b) Comment the choice of the error function we use here. Is it a good error definition?
      In [207]:
      from scipy.spatial.distance import cdist
      # calculate the Euclidean distance
      # from each time series to its cluster center and sum all distances
      def model_error_1(data, model):
          centers = model.clusterCenters
          return data.map(lambda x: np.sqrt(np.sum((np.array(centers[model.predict(x)])-np.array(x))**2)))\
                     .sum()
      
      This error is not a good indicator since it decreases monotonically with the increase of K. However at some point the improvement will slow down, and this will give us an indication of what a good value for K would be.

      Question 12

      a) Write function `model_error_2` to calculate the total of similarity of `Series` objects based on how well they match the cluster they belong to, and then calculate the error by inverse the total similarity. b) Similarly to the previous question, comment the choice of the similarity function.
      In [208]:
      # calculate the total of similarity of the model on timeseries objects
      # and calculate the error by inverse the total similarity
      
      # Estimate similarity between a data point and the cluster it belongs to.
      def similarity(centers, p):
          if np.std(p) == 0:
              return 0
          return np.corrcoef(centers[np.argmin(cdist(centers, np.array([p])))], p)[0, 1]
      
      
      def model_error_2(data, model):
          return 1. / data.map(lambda x: similarity(model.centers, x)).sum()
      
      This error is more suited to our model because it measures how much the centroid of a cluster is representative of the points contained in it using the correlation matrix. However since with the increase of K the clusters get smaller, we expect this metric to also monotonically decrease.

      Question 13

      Plot the error of the models along with the different values of K in term of different error metrics above. From the figure, in your opinion, what is the best value for `K` ? Why ?

      1) First two model_error methods

      In [31]:
      from sklearn import linear_model
      # A function to see how our data converges with the increase of K
      def plotEnd(data):
          x = np.array([50,100,200]).reshape(3,1)
          y = np.array(data[-3:]).reshape(3,1)
          regr = linear_model.LinearRegression()
          regr.fit(x, y)
          return ([0,200],regr.predict(np.array([0,200]).reshape(2,1)))
      
      In [32]:
      def testAndPlotTheResult(data, models):
          # compute the error metrics for the different resulting clusterings
          
          # errors of models when using function Sum Square Distance Error (=intra_variance)
          errors_1 = np.asarray([model_error_1(data, models[i]) for i in range(len(models))])
          E1 = errors_1 / errors_1.sum()
      
          (x1,y1) = plotEnd(E1)
          
          # error of models when using similarity
          errors_2 = np.asarray([model_error_2(data, models[i]) for i in range(len(models))])
          E2 = errors_2 / errors_2.sum()
          
          (x2,y2) = plotEnd(E2)
          
          # plot the errors with each value of K
          plt.figure(figsize=(15,6))
          plt.plot(
                ks, E1, 'k-o',
                ks, E2, 'b:v')
          plt.show()
          plt.figure(figsize=(15,6))
          plt.subplot(1,2,1)
          plt.plot(
                ks, E1, 'k-o',
                x1, y1, 'k--')
          plt.subplot(1,2,2)
          plt.plot(
                ks, E2, 'b:v',
                x2, y2, 'b--')
          plt.show()
          
      testAndPlotTheResult(training_data, models)
      
      Since the errors decrease monotonically when we increase K, we cannot expect to take a decision based on a minimum. However, we can consider that our choice of K is correct when bigger Ks do not bring significant evolution to the qulity of the model. Here we hesitate between **K=30 and K=50**: for K>50 the evolution of the error is less significant then before. The dashed lines represenent a behaviour tendency of our clusters for bigger Ks: at some point the impovement step does not increase anymore.

      2) Silhouette score

      In [106]:
      from sklearn.metrics import silhouette_score
      
      In [34]:
      # training data = rdd de points
      # rdd de labels
      from sklearn.metrics import silhouette_score
      SS = []
      X_s = np.array(training_data.collect())
      for mod in models:
          t0=time()
          predictions = np.array(training_data.map(lambda x : mod.predict(x)).collect())
          SS.append(silhouette_score(X_s, predictions))
          print("Done in %.2f" %(time()-t0))
      
      Done in 25.14
      Done in 27.24
      Done in 28.53
      Done in 29.49
      Done in 34.49
      Done in 75.50
      Done in 93.99
      Done in 115.03
      
      In [72]:
      (x1,y1) = plotEnd(SS)
      plt.figure(figsize=(15,6))
      plt.plot(ks, SS, 'k-o',)
      plt.plot(x1, y1, 'k--')
      plt.title('Avg silhouette score')
      plt.show()
      
      Once again we can see that there is no significant evolution above a certain value of K. With this metric, K=50 seems to be a good choice, because afterwards the evolution of the average silhouette score is less significant.

      Determining the optimal $k$ is particularly troublesome for the $k$-Means algorithm because error measures based on distance decrease monotonically as $k$ increases. This arises because when $k$ is increased, each cluster is decomposed into more and more clusters, such that each point becomes closer to its cluster mean. In fact, in the extreme case where $k=N$, each point will be assigned to its own cluster, and all distances are reduced to nil. Cross-validation or using holdout data is also unlikely to be particularly effective in this case.

      To this end, it is often worth assessing a model by measuring its impact on the overall aim of carrying out the clustering. For example, if we are carrying out $k$-means for grouping customers having similar taste and purchase history with the ultimate intent of making recommendations to customers, our objective function should measure how effective the recommendations are (perhaps using holdout data). An appealing aspect of using such a metric is that it is no longer guaranteed to behave monotonically with respect to $k$. We shall investigate this further in Question 20.

      Question 14

      Plot the centroids of the best model. Do you think that the result is good ?
      In [35]:
      bestModel = models[5]
      befbest = models[4]
      nexbest = models[6]
      tripbest = ks[4:7]
      print(tripbest)
      
      [30, 50, 100]
      
      In [202]:
      plt.figure(figsize=(15,6))
      plt.subplot(1,2,1)
      #plt.axis([0,20,-0.20,0.20])
      plt.plot(np.array(examples.normalize()).T)
      plt.title("Representative data with std > threshold")
      plt.subplot(1,2,2)
      plt.axis([0,20,-0.20,0.20])
      plt.plot(np.array(bestModel.centers).T)
      plt.title("Model centers for chosen best K")
      plt.show()
      
      The centroids are representative of the global behaviour of the neurons contained in the cluster. Since our data is not filtered (we have data with very low standard deviation) we can see centroids with low stddev. Overall, the result seems to be coherent with what we could expect to obtain.
      Let us try to use PCA to reduce dimensionality and get a glimpse of the cluster repartition.
      In [49]:
      # Try to plot PCA to visualize
      training_labeled = training_data.map(lambda x : (bestModel.predict(x),x))
      
      # Create structure associating points to their clusters (KMeans of MLLib)
      cluster_appartenance = [_ for _ in range(len(bestModel.centers))]
      for idx in range(len(bestModel.centers)):
          cluster_appartenance[idx]= (training_labeled.filter(lambda x : x[0]==idx).values().collect())
      
      In [50]:
      from sklearn.decomposition import PCA
      from mpl_toolkits.mplot3d import Axes3D
      import matplotlib.cm as cm
      pca = PCA(n_components=2)
      cluster_appartenance_bis = [pca.fit_transform(x) for x in cluster_appartenance]
      
      fig = plt.figure(1, figsize=(12, 12))
      colors = cm.rainbow(np.linspace(0, 1, 30))
      for (idx, c) in zip([x for x in range(30)],colors):
          if idx%5==0:
              plt.scatter([cluster_appartenance_bis[idx][i][0] for i in range(len(cluster_appartenance_bis[idx]))],\
                      [cluster_appartenance_bis[idx][i][1] for i in range(len(cluster_appartenance_bis[idx]))],\
                      c=c)
      plt.show()
      
      /opt/conda/lib/python3.5/site-packages/matplotlib/collections.py:590: FutureWarning: elementwise comparison failed; returning scalar instead, but in the future will perform elementwise comparison
        if self._edgecolors == str('face'):
      
      In [51]:
      from sklearn.decomposition import PCA
      from mpl_toolkits.mplot3d import Axes3D
      import matplotlib.cm as cm
      pca = PCA(n_components=3)
      cluster_appartenance_bis = [pca.fit_transform(x) for x in cluster_appartenance]
      fig = plt.figure(1, figsize=(12, 12))
      colors = cm.rainbow(np.linspace(0, 1, 30))
      plt.clf()
      ax = Axes3D(fig, rect=[0, 0, 1, 1], elev=8, azim=45)
      plt.cla()
      for (idx, c) in zip([x for x in range(30)],colors):
          if idx%5==0:
              ax.scatter(cluster_appartenance_bis[idx][:, 2],\
                         cluster_appartenance_bis[idx][:, 0],\
                         cluster_appartenance_bis[idx][:, 1], c=c)
      plt.show()
      
      /opt/conda/lib/python3.5/site-packages/matplotlib/collections.py:590: FutureWarning: elementwise comparison failed; returning scalar instead, but in the future will perform elementwise comparison
        if self._edgecolors == str('face'):
      
      Unfortunately, the cluster elements seem to be mixed up. However we can notice the presence of several outliers, which confirms the idea that we need to filter our data to remove some noise.

      e. Visualizing the result

      We can also plot an image of labels of neurons, such that we can visualize the group of each neuron.

      Question 15

      Complete the source code below to visualize the result of clustering.
      In [143]:
      # predict the nearest cluster id for each voxel in Series
      labels = bestModel.predict(training_data).collect()
      
      # collect data to the driver
      imgLabels = np.array(labels).flatten().reshape((2, 76, 87))
      
      # consider the voxel of the first layers
      draw_image(imgLabels[0,:,:])
      
      plt.show()
      
      In [37]:
      # Let's visualize for the values of K before and after the value we chose as the best
      labelsbef = befbest.predict(training_data).collect()
      imgLabelsbef = np.array(labelsbef).flatten().reshape((2, 76, 87))
      labelsnex = nexbest.predict(training_data).collect()
      imgLabelsnex = np.array(labelsnex).flatten().reshape((2, 76, 87))
      
      In [38]:
      plt.figure(figsize=(15,6))
      plt.subplot(1,3,1)
      plt.imshow(imgLabelsbef[0,:,:], cmap='gray', interpolation='nearest') 
      plt.title("K = %d" %(tripbest[0]))
      plt.axis('off')
      plt.subplot(1,3,2)
      plt.imshow(imgLabels[0,:,:], cmap='gray' , interpolation='nearest') 
      plt.title("K = %d (BestModel)" %(tripbest[1]))
      plt.axis('off')
      plt.subplot(1,3,3)
      plt.imshow(imgLabelsnex[0,:,:], cmap='gray' , interpolation='nearest') 
      plt.title("K = %d" %(tripbest[2]))
      plt.axis('off')
      plt.show()
      
      In [ ]:
      plt.axis('off')
      plt.imshow(img[1], interpolation='sinc', cmap= 'spectral')
      plt.title("Layer 1 of the first image")
      plt.show()
      
      bite

      With the default color scheme, this figure is quite difficult to understand and to distinguish the groups according to their similar colors. So, we should have a smater color selection. The fact is, when we do clustering, it is often the case that some centers are more similar to one another, and it can be easier to interpret the results if the colors are choosen based on these relative similarities. The method optimize tries to find a set of colors such that similaries among colors match similarities among an input array (in this case, the cluster centers). The optimization is non-unique, so you can run multiple times to generate different color schemes.

      In [145]:
      from numpy import arctan2, sqrt, pi, abs, dstack, clip, transpose, inf, \
          random, zeros, ones, asarray, corrcoef, allclose, maximum, add, multiply, \
          nan_to_num, copy, ndarray, around, ceil, rollaxis
      
      # these functions below are inspired mainly from Thunder-Project source code, v.0.6
      # url: https://raw.githubusercontent.com/thunder-project/thunder/branch-0.6/thunder/viz/colorize.py
      
      # Optimal colors based on array data similarity.
      def optimize_color(mat):
              mat = np.asarray(mat)
      
              if mat.ndim < 2:
                  raise Exception('Input array must be two-dimensional')
      
              nclrs = mat.shape[0]
      
              from scipy.spatial.distance import pdist, squareform
              from scipy.optimize import minimize
      
              distMat = squareform(pdist(mat, metric='cosine')).flatten()
      
              optFunc = lambda x: 1 - np.corrcoef(distMat, squareform(pdist(x.reshape(nclrs, 3), 'cosine')).flatten())[0, 1]
              init = random.rand(nclrs*3)
              bounds = [(0, 1) for _ in range(0, nclrs * 3)]
              res = minimize(optFunc, init, bounds=bounds, method='L-BFGS-B')
              newClrs = res.x.reshape(nclrs, 3).tolist()
      
              from matplotlib.colors import ListedColormap
      
              newClrs = ListedColormap(newClrs, name='from_list')
      
              return newClrs
      
      # Blend two images together using the specified operator.
      def blend(img, mask, op=add):
              if mask.ndim == 3:
                  for i in range(0, 3):
                      img[:, :, :, i] = op(img[:, :, :, i], mask)
              else:
                  for i in range(0, 3):
                      img[:, :, i] = op(img[:, :, i], mask)
              return img
      
      def _prepareMask(mask):
              mask = asarray(mask)
              mask = clip(mask, 0, inf)
      
              return mask / mask.max()
          
      # Colorize numerical image data.
      def transform(cmap, img, mask=None, mixing=1.0):
              from matplotlib.cm import get_cmap
              from matplotlib.colors import ListedColormap, LinearSegmentedColormap, hsv_to_rgb, Normalize
      
              img = asarray(img)
              dims = img.shape
      
              if cmap not in ['polar', 'angle']:
      
                  if cmap in ['rgb', 'hv', 'hsv', 'indexed']:
                      img = copy(img)
                      for i, im in enumerate(img):
                          norm = Normalize(vmin=None, vmax=None, clip=True)
                          img[i] = norm(im)
      
                  if isinstance(cmap, ListedColormap) or isinstance(cmap, str):
                      norm = Normalize(vmin=None, vmax=None, clip=True)
                      img = norm(copy(img))
      
              if mask is not None:
                  mask = _prepareMask(mask)
      
              if isinstance(cmap, ListedColormap):
                  if img.ndim == 3:
                      out = cmap(img)
                      out = out[:, :, :, 0:3]
                  if img.ndim == 2:
                      out = cmap(img)
                      out = out[:, :, 0:3]
              else:
                  raise Exception('Colorization method not understood')
      
              out = clip(out, 0, 1)
      
              if mask is not None:
                  out = blend(out, mask, multiply)
      
              return clip(out, 0, 1)
      
      
      # generate the better color scheme
      newClrs = optimize_color(bestModel.centers)
      plt.gca().set_color_cycle(newClrs.colors)
      plt.plot(np.array(bestModel.centers).T);
      
      # draw image with the new color scheme
      brainmap = transform(newClrs, imgLabels[0,:,:])
      draw_image(brainmap)
      plt.show()
      

      f. Improving the result by removing noise

      One problem with what we've done so far is that clustering was performed on all time-series without data pre-processing. Many of time-series objects were purely noise (e.g. those outside the brain), and some of the resulting clusters capture these noise signals. A simple trick is to perform clustering after subselecting pixels based on the standard deviation of their time series. First, let's look at a map of the standard deviation, to find a reasonable threshold that preserves most of the relavant signal, but ignores the noise.

      Question 16

      Try with different threshold of standard deviation to filter the noise. What is the "best value" that preserves most of the relavant signal, but ignores the noise ? Why ?
      In [111]:
      # calculate the standard deviation of each voxel 
      # then collect to the driver
      stdMap = data.map(lambda x: x.std()).flatten().toarray().reshape((2, 76, 87))
      thresh = 0.02
      
      # visualize the map of the standard deviation after filtering
      draw_image(stdMap[0,:,:] > thresh)
      draw_image(stdMap[0,:,:])
      draw_image(first_image[0])
      plt.show()
      
      Without the threshold in the first image we would see many white pixels which fall outside of the brain zone in the real image (which in our case is the noise). We would like a tradeoff between keeping the most of information within the zone of the brain and eliminating the noise. Our new threshold seems to be a good choice, as we can see that it produces a good mask with respect to the second image (std without threshold).

      Question 17

      Filter your data such that we only keep the voxels that have the standard deviation bigger than the threshold in question 16.
      In [217]:
      from numpy import std
      
      # remove series object that has the standard deviation bigger than a threshold
      filtered = data.filter(lambda x : x.std()>thresh).tordd().map(lambda x: np.array(x[1]))
      filtered.cache()
      print(data.shape)
      print(filtered.count())
      
      (2, 76, 87, 20)
      2227
      

      Question 18

      Re-train and choose the best models with different values of `K` on the new data.
      In [161]:
      f_models = buildModels(filtered)
      testAndPlotTheResult(filtered, f_models)
      
      As we can see, the result here is different: the similarity error curve is way flatter. In the first case, where data wasn't filtered, we had many outliers which means many points that were poorly correlated with the centroids. By removing noise, we remove the 'chaotic' elements in our set, which results in low evolution for this metric.
      In [107]:
      f_SS = []
      fX_s = np.array(filtered.collect())
      for mod in f_models:
          t0=time()
          predictions = np.array(filtered.map(lambda x : mod.predict(x)).collect())
          f_SS.append(silhouette_score(fX_s, predictions))
          print("Done in %.2f" %(time()-t0))
      
      Done in 0.61
      Done in 0.79
      Done in 0.78
      Done in 0.92
      Done in 1.10
      Done in 1.58
      
      /opt/conda/lib/python3.5/site-packages/numpy/core/_methods.py:59: RuntimeWarning: Mean of empty slice.
        warnings.warn("Mean of empty slice.", RuntimeWarning)
      /opt/conda/lib/python3.5/site-packages/numpy/core/_methods.py:70: RuntimeWarning: invalid value encountered in double_scalars
        ret = ret.dtype.type(ret / rcount)
      
      Done in 2.61
      Done in 4.80
      
      In [108]:
      (x1s,y1s) = plotEnd(f_SS)
      plt.figure(figsize=(15,6))
      plt.plot(ks, f_SS, 'k-o',)
      plt.plot(x1s, y1s, 'k--')
      plt.title('Avg silhouette score')
      plt.show()
      
      We don't understand why the average silhouette is going down..

      Question 19

      a) Plot the centroids of the best model with a smart color selection. b) Plot the result of the clustering algorithm by a color map of voxels. c) Comment about your figures.
      In [153]:
      new_bestModel = f_models[5] 
      newClrs_0 = optimize_color(new_bestModel.centers)
      
      plt.figure(figsize=(15,6))
      plt.gca().set_color_cycle(newClrs_0.colors)
      plt.plot(np.array(new_bestModel.centers).T);
      plt.show()
      
      In [154]:
      # predict the nearest cluster id for each voxel in Series
      new_labels = new_bestModel.predict(training_data).collect()
      
      # collect data to the driver
      new_imgLabels = np.array(new_labels).flatten().reshape(2, 76, 87)
      
      # consider the voxel of the first layers
      # draw_image(new_imgLabels[0,:,:])
      new_brainmap = transform(newClrs_0, new_imgLabels[0,:,:])
      draw_image(new_brainmap)
      draw_image(brainmap)
      plt.show()
      
      We recognize some of the brain activiy we've seen before (first image is produced with filtered, second is the one we produced before), the most inactive zones were left out. This is probably due to the fact that the data points that we have eliminated were already clustered together in the model before (as we can see on the first color brainmap we produced).

      g. Improve the visualization by adding similarity

      These maps are slightly odd because pixels that did not survive our threshold still end up colored as something. A useful trick is masking pixels based on how well they match the cluster they belong to. We can compute this using the similarity method of KMeansModel.

      In [155]:
      sim = data.map(lambda x: similarity(new_bestModel.centers, x))
      
      imgSim = sim.toarray()
      
      # draw the mask
      draw_image(imgSim[0,:,:], cmap='gray', clim=(0,1))
      plt.show()
      

      And, it can be used as a linear mask on the colorization output

      In [156]:
      brainmap_0 = transform(newClrs_0, new_imgLabels[0,:,:], mask=imgSim[0,:,:])
      draw_image(brainmap_0)
      plt.show()
      

      Question 20

      Since in the usecase we build and test the model from the same data, it can lead to overfitting problems. To avoid that, we can divide the data into training set and testing set. Note that each neuron occurs only one time in the data. So, we can not divide the data by dividing the neurons. Instead, we can divide the states of neurons into two different sets. Let's try with this approach and show the result.
      As we try to look for neurons that have similarities in their activities, we should try to take some out to see how it influences the model. Also, we could do a cross validation by training our model on some neurons and use other neurons as test and see if those are in the salme cluster. We will separate our data using the filtered dataset as before, because it concerns only the brain (we know that the data left out were clustered together anyway). Since the split is random, we will run it 6 times to average the results.
      In [171]:
      E1_av = np.zeros(8)
      E2_av = np.zeros(8)
      for j in range(6):
          [training_set, test_set] = filtered.randomSplit([0.8,0.2])
          cross_models = buildModels(training_set) #models built with 80% of data
          errors_1 = np.asarray([model_error_1(test_set, cross_models[i]) for i in range(len(cross_models))])
          E1 = errors_1 / errors_1.sum()
          E1_av = np.add(E1_av, E1)
          errors_2 = np.asarray([model_error_2(test_set, cross_models[i]) for i in range(len(cross_models))])
          E2 = errors_2 / errors_2.sum()
          E2_av = np.add(E2_av, E2)
          print("Random %d" %(j))
      E1_av = np.divide(E1_av,6)
      E2_av = np.divide(E2_av,6)
      (x1,y1) = plotEnd(E1_av)
      (x2,y2) = plotEnd(E2_av)
      plt.figure(figsize=(15,6))
      plt.plot(
                ks, E1_av, 'k-o',
                ks, E2_av, 'b:v')
      plt.show()
      plt.figure(figsize=(15,6))
      plt.subplot(1,2,1)
      plt.plot(
                ks, E1_av, 'k-o',
                x1, y1, 'k--')
      plt.subplot(1,2,2)
      plt.plot(
                ks, E2_av, 'b:v',
                x2, y2, 'b--')
      plt.show()
      
      Random 0
      Random 1
      Random 2
      Random 3
      Random 4
      Random 5
      
      We can see that the curves are similar to those obtained with 100% of the filtered dataset. This shows that we can in fact remove some parts of the original dataset, since they do not influence the obtained results.

      Question 21

      Is using K-Means the best choice for a clustering algorithm? Comment the choice and suggest alternatives. For example, look at [Mixture Models](https://en.wikipedia.org/wiki/Mixture_model) and, if you have time, propose an alternative clustering technique.

      NOTE
      Mixture models will be covered in the ASI course in greater detail.

      The KMeans approach is a limited version of Mixture Models: in fact in Kmeans, for a given point, its cluster is chosen exclusively regarding the euclidean distance from it to the cluster center, whereas in a Mixture Model, in addition to that information, we use the cluster density with the Gaussians by estimating probability distributions.
      In [218]:
      k_val = 30
      from pyspark.mllib.clustering import GaussianMixtureModel, GaussianMixture
      GMmodel = GaussianMixture.train(filtered, k_val, convergenceTol=0.0001, maxIterations=150)
      
      In [219]:
      gmm_centers = [GMmodel.gaussians[i].mu for i in range(len(GMmodel.weights))]
      
      newClrs_G = optimize_color(gmm_centers)
      plt.gca().set_color_cycle(newClrs_G.colors)
      plt.plot(np.array(gmm_centers).T)
      
      
      GM_labels = GMmodel.predict(training_data).collect()
      
      #KMeans_imgLabels = np.array(new_labels).flatten().reshape((2, 76, 87))
      
      GM_imgLabels = np.array(GM_labels).reshape(2, 76, 87)
      
      
      #KMean_brainmap = transform(newClrs, KMeans_imgLabels[0,:,:])
      #draw_image(KMean_brainmap)
      #plt.show()
      
      #GM_labels.take(10)
      #print (KMeans_imgLabels[0])
      
      GM_brainmap = transform(newClrs_G, GM_imgLabels[0,:,:])
      draw_image(GM_brainmap)
      plt.show()
      
      We cannot really see the 30 clusters here, we don't understand why.
      In [220]:
      sim = data.map(lambda x: similarity(gmm_centers, x))
      
      imgSim = sim.toarray()
      
      # draw the mask
      GM_brainmap1 = transform(newClrs_G, GM_imgLabels[0,:,:], mask=imgSim[0,:,:])
      draw_image(GM_brainmap1)
      draw_image(brainmap_0)
      
      plt.show()
      
      It seems that with a smaller K, the mixture model produces similar results. However, we couldn't find a good way to measure the efficiency of this model throughout the different values of K.
      In [201]:
      plt.figure(figsize=(7,7))
      plt.axis('off')
      plt.imshow(img[0], interpolation='sinc', cmap= 'spectral')
      plt.title("Layer 1 of the first image")
      plt.show()
      
      We plotted the image that we are trying to approximate with our clustering. We can see that it's quite similar : the main activity areas are visible in our generated images of the brain. But we miss several areas and nuances.

      4. Summary

      We studied Thunder and its important methods to work with images, such as Image, Series and how to apply them to a use case. In the use case, we used the K-Means algorithm to cluster the neurons without prior knowledge of what a good choice of K could be. Subsequently, we introduced some techniques for improving the initially obtained results, such as removing noise and considering similarity.

      References

      Some of the examples in this notebook are inspired from the documentation of Thunder.